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SOME  COMMENTS  ON  THE  MINIMUM  MEAN  SQUARE 
ERROR  AS  A  CRITERION  OF  ESTIMATION 


C.  Radhakrishna  Rao 


Abstract .  It  is  shown  that  estimators  obtained  by  MMSE 
(minimizing  the  mean  square  error)  may  not  have  optimum 
properties  with  respect  to  other  criteria  such  as  PN 
(probability  of  nearness  to  the  true  value  in  the  sense 
of  Pitman)  or  PC  (probability  of  concentration  around 
the  true  value).  In  particular,  a  detailed  study  is 
made  of  estimators  obtained  by  shrinking  the  minimum 
variance  unbiased  estimators  to  reduce  the  MSE.  It  is 
suggested  that  because  of  mathematical  convenience  and 
some  intuitive  considerations,  MMSE  could  be  used  as  a 
primitive  postulate  to  derive  estimators,  but  their  accep¬ 
tability  should  be  judged  on  more  intrinsic  criteria  such 
as  PN  and  PC. 


AMS(MOS)  Subject  Classification:  62F10,  62F15 

Key  Words  and  Phrases:  Inverse  regression,  James-Stein 
estimator,  Minimum  mean  square  error,  Shrunken  estimator. 


1.  INTRODUCTION 


The  concept  of  minimum  mean  square  error  (MMSE)  as  a 
criterion  of  estimation  is  attributed  to  Gauss  and  figures 
prominently  in  the  discussion  of  problems  of  statistical 
estimation.  No  doubt,  the  criterion  is  a  valid  one  if  the 
problem  of  estimation  is  considered  in  a  decision  theoretic 
frame  work  with  the  loss  function  specified  as  the  square 
of  the  error  in  an  estimator.  Otherwise,  the  criterion 
is  arbitrary  as  Gauss  himself  has  observed  in  a  paper  pre¬ 
sented  to  the  Royal  Society  of  Gottingen  in  1809: 

•  oo 

From  the  value  of  the  integral  J  xd>(x)  dx ,  i.e., 

_  CO 

the  average  value  of  x  (defined  as  deviation  in  the 
estimator  from  the  true  value  of  the  parameter)  we  learn 
the  existence  or  non-existence  of  a  constant  error  as 
well  as  the  value  of  this  error;  similar] v,  the  integral 
/  x  <J>(x)dx,  i.e.,  the  average  value  of  x  ,  seems  very 

—  OO 

suitable  for  defining  and  measuring,  in  a  general  way, 
the  uncertainty  of  a  system  of  observations.  ...  If  one 
objects  that  this  convention  is  arbitrary  and  does  not 
appear  necessary,  we  readily  agree.  The  question  which 
concerns  us  here  has  something  vague  about  it  from  its 
very  nature,  and  cannot  be  made  really  precise  except  by 
some  principle  which  is  arbitrary  to  a  certain  degree.  .. 
It  is  clear  to  begin  with  that  the  loss  should  not  be 
proportional  to  the  error  committed,  for  under  this 

MR  FORCE  OFFICE  OF  SCIENTIFIC  RESEARCH  (AFSC) 
NOTICE  -OF  TRANSMITTAL  TO  DOC 
This  technical  report  has  been  reviewed  and  ia 
approved  for  public  release  IA W  AFR  190-18  (7b) 
Distribution  is  unlimited. 

A.  D.  BLOSH 

Technical  Information  Officer 


hypothesis,  since  a  positive  error  would  be  considered 
as  a  loss,  a  negative  error  would  be  considered  as  a  gain; 
the  magnitude  of  a  loss  ought,  on  the  contrary,  to  be 
evaluated  by  a  function  of  the  error  whose  value  is  always 
positive.  Among  the  infinite  number  of  functions  satis¬ 
fying  this  condition,  it  seems  natural  to  choose  the  sim¬ 
plest,  which  is,  without  doubt,  the  square  of  the  error, 
and  in  this  way  we  are  led  to  the  principle  proposed  above". 
Karlin  (1958)  expresses  the  same  opinion: 

"The  justification  for  the  quadratic  loss  as  a  measure 
of  the  discrepancy  of  an  estimate  derives  from  the  follow¬ 
ing  two  characteristics:  (i)  in  the  case  where  a(x)  repre¬ 
sents  an  unbiased  estimate  of  h(w),  MSE  may  be  interpreted 
as  the  variance  of  a(x)  and,  of  course,  fluctuations  as 
measured  by  the  variance  is  very  traditional  in  the  domain 
of  classical  estimation,  (ii)  from  a  technical  and  mathe¬ 
matical  viewpoint  square  error  lends  itself  most  easily  to 
manipulation  and  computations". 

Thus,  the  criterion  of  MMSE  is  used  not  because  of  its 
practical  relevance  in  a  given  problem  but  for  its  simplicity 
and  mathematical  convenience.  We  may,  therefore,  accept  MMSE 
as  a  primitive  postulate  providing  a  rule  of  estimation  like 
other  methods  such  as  maximum  likelihood,  minimum  chi-square, 
etc. ,  and  examine  the  properties  of  estimators  so  obtained 
in  terms  of  other  criteria. 


2 


The  present  study  is  limited  to  the  examination  of  es¬ 
timators  obtained  by  "shrinking"  unbiased  estimators  with  a 
view  to  decrease  the  MSE.  We  compare  the  shrunken  estimator 
with  the  unbiased  estimator  in  terms  of  its  bias  (B),  mean 
absolute  error  (MAE),  mean  square  error  (MSE),  mean  quartic 
error  (MQE),  and  more  intrinsic  properties  like  the  probabil¬ 
ity  of  nearness  to  the  true  value  (PN)  due  to  Pitman  (1937), 
and  probability  of  concentration  in  intervals  round  the  true 
value  (PC). 

In  the  discussion  on  a  recent  paper  by  Berkson  (1980), 
the  author  (Rao,  1980)  has  pointed  out  some  anamolies  that 
may  result  in  accepting  MMSE  as  a  criterion  of  estimation. 
Examples  were  given  of  estimators  which  have  a  smaller  MSE 
but  perform  poorly  in  terms  of  more  intrinsic  criteria  such 
as  PN  and  PC  wften  compared  to  other  estimators.  Such  ana¬ 
molies  are  expected  since  the  quadratic  loss  function  places 
undue  emphasis  on  large  deviations  which  may  occur  with  small 
probability,  and  minimizing  MMSE  may  insure  against  large 
errors  in  an  estimator  occurring  more  frequently  rather  than 
providing  greater  concentration  of  an  estimator  in  neighbor¬ 
hoods  of  the  true  value.  A  more  detailed  study  of  such  sit¬ 
uations  is  made  in  the  present  paper. 

2.  ESTIMATION  OF  A  SINGLE  PARAMETER 

Let  X  be  an  unbiased  estimator  of  a  parameter  0  with 
2 

V(X)  -  a  .  It  is  well  known  that  with  respect  to  a  quad- 


ratio  loss  function,  cX  is  an  admissible  estimator  of  8  if 
0<c£l  (see  Rao,  1976b  for  instance).  The  MSE  of  c X  is 

E(c  X  -  6)2  =  o2[c2  +  (l-c)262]  <E(X-0)2  (2.1) 

2 

iff  6  £  (l+c)/(l-c)  where  6  =  8/a.  Thus,  if  we  have  some 

knowledge  of  5,  we  can  make  an  appropriate  choice  of  c  to 

2 

ensure  the  inequality  in  (2.1).  The  minimum  of  E(cX-0) 

2  2 

is  attained  at  c  =  6  /(1  +  6  ),  and  if  it  is  known  that  the 
true  6  is  near  about  6  ,  we  may  try  the  estimator 

S2 

X„  =  — ^  X  (2.2) 

°  1  +  6 

o 

which  has  the  property 

r 

E„  =  [  E  (  X  -  0)2/E(X  -  0)2]*1  <  1  if  |6  |<(262+1)4.  (2.3) 

&  o  o 

But  the  property  (2.3)  does  not  ensure  that 

PN  =  Pr  -  ( |  X  -0|  <  | X  -  6  |  )  >0.5  (2.4) 

for  the  same  range  of  6.  Table  1  gives  the  approximate 

values  of  6  below  which  PN.>0.5  and  <  1  for  different 

2  2 

values  of  the  shrinkage  factor  c  =  6q/(1+6o)  and  the  asso¬ 
ciated  values  of  6  . 


TABLE  1 


Values  of  |6|  below  which  Eg  _<  1  and  PN^>  .05 
for  different  shrinkage  factors 


TABLE  1  shows  that  the  range  of  6  for  which  (2.4)  holds  is 
much  smaller  than  that  for  (2.3)  to  hold.  It  is  also  interest¬ 
ing  to  note  that  the  optimum  choice  of  c  corresponding  to  a 
given  &o  for  reducing  the  MSE  does  not  ensure  that  PN>_0.5 
even  for  6  =  6q  unless  6q  is  below  1.2  (approximately). 

Thus,  shrinking  an  unbiased  estimator  is  useful  only  when  the 
true  value  of  the  parameter  under  estimation  is  smaller  than 

about  1.2  times  the  standard  error  of  estimation. 

2 

If  0  ,  the  variance  of  the  estimator  X,  is  unknown,  but 
2  2 

an  estimator  s  of  a  is  available,  we  can  define  an  empirical 
version  of  (2.2) 


X  =  sun  x 

9  l+(X/s) 


(2.5) 


and  study  its  performance.  The  MSE  of  Xg  compared  to  that  of 
X  has  been  extensively  studied  by  Thompson  (1968)  under  various 
distributional  assumptions  on  X.  We  shall  examine  other  pro¬ 
perties  of  (2.5)  assuming  that  X  is  normally  distributed  and 


5 


1 


a  is  known.  As  shown  by  Thompson,  the  conclusions  are  not 

likely  to  be  different  when  a  is  used  instead  of  s  in 

(2.5)  even  for  small  values  of  f,  the  degrees  of  freedom  on 
2 

which  a  is  estimated. 

Table  2  gives  the  values  of 

B  =  cT1  E(Xe  -  0)  ,  PN  =  Pr  .  (  | X  -  0  |  <_  | X  -  0  |  )  f 


=  a  *  /tt/2 


ex  -  e 


E2  = 


0  1CE(Xe  -  0)2]‘ 


E.  = 


=  o  1 [E( X  -  6 )4 /3  ] , 


obtained  by  simulation.  It  is  seen  that  the  empirically  shrunken 
estimator  Xg  is  better  than  the  unbiased  estimator  X  only  when 
6  £  1.4  (approximately),  i.e.,  when  the  standard  error  of  the 
estimator  of  a  parameter  is  more  than  70%  of  the  value  of  the 
parameter.  But  a  serious  drawback  of  the  estimator  (2.5)  may 
be  the  large  negative  bias  it  has  unless  6  i;  very  small  or 
very  large. 


6 


TABLE  2 


Values  of  Ej ,  E^ ,  E^ ,  ,  E  =2E^,  PN  and  B  for  the 

2  2  2 

estimator  X/(s  +X  )  for  different  valued  of  f,  =  0/o. 


6 

B 

Ei 

K2 

K1 

E  PN 

0.0 

-  .005 

.549 

.702 

i 

.813  i 

.688 

1.000 

0.5 

-.171 

.757 

.764 

.818  : 

.780 

.  706 

1.0 

-.290 

.946 

.893 

.  856  ' 

.898 

.  565 

1.4 

- .  344 

1 .047 

.993 

.  9  1 9 

.986 

i 

.  504 

1.5 

-.352 

1.066 

1 .015 

.938 

1 .006 

.487 

2.0 

-.367 

1 .124 

1 .097 

I  .031 

j  1 . 084 

.444 

2.5 

-.350 

1.136 

1  .  131 

1.091 

|  1.120 

.436 

3.0 

-.317 

1.124 

1.133 

1.121 

i  1.126 

I 

.437 

3.5 

-.283 

1.105 

1.118 

1.118 

|  J-1M 

.444 

4.0 

-.252 

]  .086 

1  . 100 

1.102 

|  1 . 096 

.453 

8.0 

-.130 

1.026 

1  .037 

1 . 033 

i  1 .023 

.4  76 

10.0 

-.  105 

1 .018 

1.028 

1  .026 

1  . 024 

.480 

20.0 

-  .  055 

1 . 007 

1 . 0  !  7 

1.015 

1.01 3 

.  4  90 

100.0 

-.015 

! 

1  .  '103 

1.01  3 

1  . 01  2 

1  ,  -  'Of) 

.  495 

3.  ESTIMATION  OF  VARIANCE 

2 

If  S  dencO es  the  corrected  sum  of  squares  of  n  i.i.d 

2  ^2 
observations  from  N(u,o  ),  it  is  well  known  that  s  -S  /(n-l) 

9 

is  the  minimum  variance  unbiased  estimator  of  But 

2  2  2  2 
S2=S  /(n+1)  has  smaller  MSE  than  s  uniformly  for  all  a  and 

2  2 
all  n,  so  that  s  is  inadmissible  as  an  estimator  of  a  with 

2  2 

respect  to  the  MSE  criterion.  How  does  s,.,  compare  with  s 
with  respect  to  other  criteria?  Table  3  gives  the  values  of 


the  following  for  different  degrees  of  freedom  (n-1): 


2  2  2 
B  =  E(s^  -  o“)/a  , 

PN  =  Pr.(  |s2  -  o2  |  £  J  £  |s2_o2|). 

2  2 

PC  =  Pr.(-log  a  <  log  s  -  log  o'  <  log  a  )  , 

2  2 

PCg  =  Pr.(-log  a  _<  log  Sg  -  log  a  £  log  a  )  , 

2  2  2  2  9  2  -J 

E2  =  rE(s2  -  a  )  /E(s^  -  a  )  ]  . 

2  2 

It  is  seen  that  PN,  the  probability  that  s^  is  closer  to  a 

2  2 
than  s  ,  is  less  than  0.5  uniformly  for  all  a  and  For  all  n 

9 

although  is  uniformly  less  than  unity  for  all  o  and  for 

2 

all  n.  Similarly,  log  s  has  a  greater  concentration  proba- 

9  9 

bility  in  any  symmetrical  interval  around  log  o'  Ilian  log  s9 

9 

uniformly  for  all  o  and  all  n.  Thus  shrinking  the  ur.Mn«<-d 
2 

estimator  s  has  resulted  in  a  smaller  MSP;  but  has  not  br'’v;ht 

2 

the  estimator  closer  to  the  true  value  of  a  In  any  sens'  . 

The  unbiased  estimator  s2  seems  to  have  better  intrinsic  pro¬ 
perties  than  s2 . 

9 

It  may  bo  noted  that  the  optimum  shrinkage  of  s  depends 

on  the  loss  function  chosen.  If  instead  of  the  MSE,  we  choose 
9  2d 

the  MQE  =  E( rs“-  j')  as  the  loss,  then  1  ho  (ntimum  e  is  a 
solution  of  flic  cubic  equation 


TABLE  3 


Values  of  Eg ,  B,  PN,  PC  and  PCg 
for  different  degree's  of  freedom  (DF) 


D.F. 
(n-1 ) 


E2 

B* 

PN 

J _ L 

.577 

.677 

.221 

.707 

.500 

.264 

.774 

.400 

.290 

.  816 

.333 

.308 

.  845 

.286 

.323 

.866 

.250 

.334 

.  882 

.222 

.346 

.894 

.200 

.352 

.904 

.  182 

.  359 

.912 

.  167 

.  365 

.953 

.091 

.400 

.976 

.048 

.428 

PC  (first  row)  and  PCg(second  row) 
=1.5  a=2 . 0  a=2 . 5  a=3 


*The  shrinkage  factor  is  (1-B)  where  B  =  Bias/a 


2  2 

PC  =  Pr .( -log  a  £  log  s  -  log  a  £  log  a  ) 

2  2 

PC^  =  Pr .( -log  a  £  log  s ^  -  log  a  £  log  a  ) 

2  2 

for  i  =  1  and  4.  It  is  seen  that  s  performs  better  than 

2  9  2 

and  s  in  terms  of  PN  and  PC.  Among  the  estimators  s~,  s^  and 

2  2  2  2 
s4,  appears  to  be  better  than  s 2  and  s4 .  The  results  are 

O 

not  unexpected  since  the  distribution  of  s  is  skew  on  the 

right  and  minimization  of  an  expression  of  the  type 

2  2  m  2 
E(c  s  -  a  )  pulls  the  estimator  away  from  a  in  the  region 

2 

around  and  below  the  modal  value  of  s  . 

It  is  not  clear  why  in  statistical  literature  much 

2 

emphasis  is  laid  on  the  estimation  of  a  and  noton  a  although 
in  practice  the  latter  should  be  the  parameter  of  direct 
interest.  Unfortunately,  none  of  the  properties  such  as 


10 


♦The  optimum  shrinkage  factor  is  1  -B  where  B  =  Bias/o 


unbiasedness  and  MMSE  are  preserved  under  transformations 
of  estimators  and  parameters.  For  instance,  the  minimum 
variance  unbiased  estimator  of  a  is 


s* 


(3.3) 


which  is  different  from  s  while  the  MMSE  of  a  is 


S2 


r  (■ 


n-l  i 

•J 


(3.4) 


which  is  different  from  Sg .  Now 

E(s*  -  a)2  =  a2(t2  -  1)  >  2o2(l  -  i)  =  E(s  -  o)2  (3.5) 

so  that  s  has  a  smaller  MSE  than  s*  as  an  estimation  of  a. 

We  shall  compare  the  relative  performances  of  s  and  s*  as 

v  2  2 

estimators  of  o  and  of  s“  and  (s*)  as  estimators  of  o  . 

Table  5  gives  the  values  of  the  following  for  different 

degrees  of  freedom: 

Eg  =  [E(s-a)2/E(s*-a)2l4, 

PNj  =  Pr  .(  |  s*-a  |  <  |s-o|), 

PNg  =  Pr.(  |(  s*)2-o2i  <  |s2-a2|)) 

2  2 

PC  =  Pr.(-log  a  £  log  s  -  logo  <  loga)( 

PCg  =  Pr  .(-log  a  <  log(s*)2  -  log  o2  £  log  a). 
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It  is  seen  that  although  5 1  uniformly  for  all  o  and  DF 

so  that  s  has  a  smaller  MSE  than  s*  as  an  estimator  of  a, 

PNj  is  uniformly  above  0.5  so  that  s*  is  nearer  to  0  more 

often  than  s.  What  is  more  interesting  is  that  PN^  is  also 

2  2 

uniformly  above  0.5  indicating  that  (s*)  is  nearer  to  o' 

2  2 

more  often  than  s  .  Further,  log(s*)  has  greater  concentra- 

2  2  2 

tion  around  log  o  than  logs  around  logo  if  the  DF  is  not 

small  and  the  interval  chosen  is  not  short.  It  appears  that 

2  2 

the  biased  estimator  (s*)  of  a  has  better  properties  in 

2 

terms  of  PN  and  PC  than  s  ,  although  highly  inadmissible  with 
respect  to  MSE. 

4.  DIRECT  OR  INVERSE  REGRESSION 
Consider  a  pair  of  random  variables  (6,  Y)  such  that 

Y  =  0+e,  E(  e  )=0 ,  cov  (  0  ,  e  )=0  ,  V(e)=a^.  (4.1) 

In  practice  6  stands  for  the  true  value  of  a  quantity  (such 
as  the  cholesterol  level  of  a  blood  sample)  and  Y  is  a  measure¬ 
ment  of  0  subject  to  error.  Only  Y  is  observable  and  not  0, 
in  which  case  the  problem  is  one  of  estimating  or  predicting 
9  given  Y. 

From  (4.1),  the  regression  of  Y  on  6  is  6  itself  so  that 
the  inverse  regression  estimate  of  0  is  Y  which  is  also  an  un¬ 
biased  estimator  of  0.  On  the  other  hand,  if  the  mean  (y) 

2 

and  variance  (aQ)  of  the  unconditional  distribution  of  9  is 
known,  then  the  regression  of  9  on  Y  is 
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Values  of  E~.  PN, ,  PN„ ,  PC  and  PC0  for  various  de 


(Y  -  y) 


(4.2) 


0  =  y  + 


'0 


2  2 
an+o 
0  o 


which  provides  a  direct  regression  estimate  of  0.  In  practice, 

the  estimation  procedure  (4.2)  can  be  implemented  by  estimating 
2  2 

y,  Oq  and  oq  from  past  data  on  Y  (cholesterol  determinations) 
on  a  large  number  of  individuals  (see  Rao,  1973,  p.  337), 
and  updating  the  estimates  as  more  data  accumulate.  The  es- 

A 

timator  0  can  be  identified  as  the  Bayes  estimator  using  a 
quadratic  loss  function  and  a  relevant  prior  distribution 
for  0. 

Suppose  that  an  individual's  blood  sample  has  been  referred 
to  a  clinic  for  the  determination  of  cholesterol  and  the  clinic 
reports  the  measurement  as  Y.  What  should  we  record  as  the  es¬ 
timate  of  blood  cholesterol  for  the  individual ,  the  unbiased 
estimator  Y  or  the  Bayes  estimator  e  of  (4.2)  using  a  relevant 
estimated  prior  distribution?  There  has  been  considerable  con¬ 
troversy  on  this  subject,  in  a  slightly  different  context,  in 
the  calibration  problem  (see  Berkson ,  1969;  Halperin,  1970; 
Krutchkoff,  1967,  1969,  1971  and  Williams,  1969).  We  shall 

examine  this  problem  in  the  set  up  of  (4.1)  assuming  that  the 
2 

parameters  y,  a_  of  the  prior  distribution  and  the  variance 

U 

2 

a  of  the  error  of  measurement  are  known.  Now 
o 


E( 0- 0  )  = 


2  2 

gQ  go  .  2 
2,2  l°o 
Vao 


=  E(  Y-0  )‘ 


(4.3) 
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and  the  strict  inequality  holds  if  oQf  0,  so  that  the  mean 

square  error  of  prediction  is  smaller  for  0.  Does  this 

/\ 

mean  that  9  is  closer  to  9  than  Y  in  some?  sense?  To  examine 
this  question  we  have  to  consider  the  distributions  of  Y  and 
0  for  given  0. 

A 

The  MSE's  and  Y  and  0  for  given  0  are 

E[(Y-0)2|0]  =  a2  (4.4) 

EC  (  0-O)2  |  0  ]  =  0262(<S2  +  X2)/(l  +  62)2  (4.5) 

where  (0-u)/a  =X  and  6=oa/o  .  From  (4.4)  and  (4.5), 

E[(0-0)2  |0]  <  EC(Y-0)2 | 6  3  (4.6) 

iff  X2  £  (l+2<$2)/62.  Then  the  efficiency  of  0  compared  to 
Y  with  respect  to  MSE  depends  on  the  magnitude  of  the  devia¬ 
tion  of  the  true  value  of  0  from  the  apriori  mean.  If  the 
deviation  is  large,  0  is  less  efficient  than  Y. 

The  estimator  Y  is  unbiased  while  the  bias  in  0  is 

n  n 

ECO-©)  |0]  =  -Xaea()/(oo+o“)  (4.7) 

so  that  large  values  of  0  are  under-estimated  and  small  values 
are  over-estimated. 

Table  6  gives  for  different  combinations  of  6  and  X  the 
values  of 

e2  =  [E{(Q-0)2|0}/E(Y-O)2]*  , 

PN  =  Pr . ( | 0-e 1  <  | Y- 0 | ) , 
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where  the  region'  for  which  (i)  E„  <  1 ,  PN>0.5,  (ii)  E_  <  1 

^  2  * 

PN<0.5  and  (iii)  Er  >  1  ,  PN<0.5  are  marked.  It  is  seen  that 

A 

0  performs  better  than  Y  when  the  error  of  measurement  is 
large  and  the  true  value  is  near  the  mean  of  the  apriori  dis¬ 
tribution.  But  if  precise  estimation  of  large  deviations  from 
the  apriori  mean  is  more  important  (as  it  should  be  in  a  pro¬ 
blem  like  the  estimation  of  blood  cholesterol),  Y  should  be 


preferred  to  6. 


TABLE  6 


Values  of  E ^  (first  entry)  and  PN  (second  entry) 
for  different  combinations  of  X  and  6 


1 


5.  SIMULTANEOUS  ESTIMATION  OF  TWO  PARAMETERS 

Let  X1  ~  N( 8 x , a2 ) ,  Xg  ~  N(62,a2)  andfs2  ~  o2  X2(f)  be 
independent  random  variables,  and  consider  the  following 
estimators  of  0^,02 


t,  = 


Xl+X2  X1"X2 

- - -  +  C  - p; - 


(5.1) 


t„  - 


Xl+X2  .  .  X2-Xl 

o  +  C  n 


(5.2) 


as  alternatives  to  the  unbiased  estimators  and  X^ 


Then 


2  2_, 


E( 


yep*.  (5.3) 


and  the  expected  compound  quadratic  loss  (ECQL)  is 


+ 


(l-c)262 

2 


(5.4) 


where  6  =  The  expression  (5.4)  attains  the  mini- 

2  2  2 

mum  when  c  =  6  /(2+<5  ).  Since  6  is  not  known,  we  may  con¬ 
sider  the  empirical  versions  of  (5.1)  and  (5.2) 


(e) 


(e) 


v*2  <w2/°2  V*2 

2  2+(X1-X2)2/s2  2 

Vfg  (X1-!I2)2/8a  yx, 

2  2+(X1-X2)2/s2  2 
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(  p)  (  p) 

We  shall  compare  and  with  Xj  and  X2 ,  assuming 

that  is  known,  with  respect  to  the  following  criteria: 

B1  =  °“1(t{e)-01),  B2  =  a"1(t^e)-02), 

PN  =  |  jpr ,( |t^e)-e1 1  <  (Xj-eJ)  +  Pr.  ( |t^e)-e2|  <  |x2-e2|) 
E  =  (Ej  +  E2  +  E4)/3  , 

where 

E1  =  a~1/i72(iE|tJe)-01|+iE|t2e)-e2i), 

E2  *  °'1[5E<t$e)-ei)24E(t2,!,-92)2]S- 
E4  ■=  0-1[iE(t<e)-ei)4+iEu£O)~92>‘r]  \ 

Table  7  gives  the  values  of  E^  ,  E^,  ,  E,  PN  and  B^ ,  B2 

based  on  a  simulation  study  using  1000  samples,  for  various 

values  of  6  =  .  It  is  seen  that  simultaneous 

(  0  \  (  0  \ 

estimation  of  0j  ,  0g  by  and  t^  has  some  advantage 

over  and  X2  when  6  _<  2  (approximately),  i.e.,  when  the 

parameters  under  estimation  do  not  differ  by  more  than  twice 
the  standard  error  of  the  estimator  of  a  single  parameter. 

6.  ESTIMATION  OF  SEVERAL  PARAMETERS 

Let  X^  ~  N(0^,a2),  i  =  l . p  and  fs2  ~  o2  be 

independent  random  variables,  where  (0^,...,0  )  =  0'  is  a 


19 


TABLE  7 


Values  of  ,  E 2>  E^, 

t^e^  and  tg6^  for  different 

E,  B1(  B2 

values  of 

and  PN 

6  =  (0 

for 

1-02)/o 

6 

B1 

B2 

PN 

E1 

E2 

R4 

E 

0 

-.027 

.005 

.702 

.854 

.864 

.872 

.863 

.5 

.052 

-.132 

.674 

.871 

.871 

.878 

.873 

1.0 

.121 

-.164 

.635 

.894 

.887 

.883 

.888 

1.5 

.216 

-.145 

.567 

.962 

.956 

.954 

.957 

2.0 

.257 

-.259 

.517 

1.013 

.998 

.976 

.996 

2.5 

.269 

-.277 

.485 

1.041 

1.029 

1.000 

1.023 

3.0 

.199 

-.243 

.475 

1.046 

1.045 

1.041 

1  .044 

3.5 

.279 

-.234 

.455 

1.081 

1 .064 

1.037 

1.061 

4.0 

.229 

-.251 

.44  2 

1 .087 

1.083 

1 .080 

1.083 

5.0 

.  188 

-.132 

.454 

1.037 

1.034 

1 .023 

1.031 

6.0 

.207 

-.181 

.465 

1.027 

1  .029 

1.025 

1.027 

7.0 

.197 

-.118 

.468 

1 .042 

1 .040 

1.049 

1  .044 

8.0 

.085 

-.102 

.491 

1  .022 

1.018 

1.010 

1.017 

9.0 

.070 

-  i  108 

.490 

1.015 

1.024 

1 .929 

1  .023 

10.0 

.097 

-.105 

.485 

1.006 

1.009 

1.025 

1.013 

fixed 

vector 

parameter 

James 

and  Stein 

(1961 ) 

have  found  the 

remarkable  result  that  when  p>3  there  exist  statistics 

Ti  =  W  •  •  •  ,Xp,s2).  i  =  1 . P  (6A) 

such  that 

ET  ECTi-ei )2  3  <  E[E(Xi-6.)2]  (6.2) 

uniformly  for  all  0 i ,  which  implies  that  x'  =  (X^ . X  )  as  an 
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estimator  of  A  is  inadmissible  with  respect  to  the  CQL 

(compound  quadratic  loss)  function.  The  result  (6.2) 

gives  the  impression  that  we  stand  to  gain  by  answering 

several  problems,  possibly  unrelated,  simultaneously. 

It  is  well  known  that  there  do  not  exist  statistics  t. 

1 

alternative  to  X.  such  that 
1 

E(t.-0.)2  <  E(X.-0.)2,  i  =  l . p  (6.3) 

uniformly  for  all  8^,  so  that  the  overall  reduction  in  the 
ECQL  possibly  takes  place  by  an  increase  in  the  MSE  for 
some  parameters  and  decrease  to  a  larger  extent  for  the 
others.  To  examine  this  phenomenon  in  some  detail,  we 
shall  consider  a  number  of  alternative  joint  estimators  of 
6,,..., 6  of  the  type  suggested  by  James  and  Stein  and  study 

j.  p 

the  performance  of  individual  estimators. 

Specifically,  we  consider  the  following  types  of 

estimators  of  0,  . 6  : 

1  P 


Tli 

=  bX.  ,  i  =  1 ,  .  .  .  , p  , 

(6.4) 

T2i 

=  a+b(Xi~a) ,  i = 1 , . . . , p  , 

(6.5) 

T3  i 

=  a+bj (Xi-a) ,  i =  1 . p  , 

(6.6) 

which  may  be  represented  by  and  in  vector 

notat i on  . 
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•  ■ 


Now 


E  [T(Tli-ei)2]=  pb2o2  +  (l-b)2E02  (6.7) 

2  2 

which  attains  the  minimum  value  at  b  =  v  /(1+v  )  where 
2  2  2 

v  =  10^/pa  .  If  v  is  known,  then  the  optimum  estimator 
of  the  type  (6.4)  is 


■1  =  * 


(6.2) 


1+v 


and  the  ECQL  is 


EEI(T-,-e.)2]  =  pa2  >  pa2  =  E[X(X.-0.  )21 

11  1  1+v  1 


(6.9) 


The  MSE  for  an  individual  estimator  is 

E(Tli"9i)2  =  (6.10) 

where  =  B^/a.  The  expression  exceeds  the  MSE  of  X^  if 
2  2 

V.  >  2v  +  1  indicating  the  possibility  that  in  joint  estima¬ 
tion  of  the  T^-type,  the  larger  parameters  are  less  effi¬ 
ciently  estimated  than  the  smaller  ones. 

2  2 
If  v  is  not  known,  we  may  estimate  l/(l+v  )  by 

2  2 

c  s  /ZXi,  where  c  is  a  suitable  constant,  and  obtain  an 
empirical  version  of  (6.8) 

T(e)  =  (1-  £-§-)  X.  (6.11) 

1  XX 


The  best  choice  of  c  obtained  by  minimizing  the  ECQL  of 
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(6.11)  is  f(p-2)/(f+2)  if  p  3 ,  which  leads  to  the  James- 
Stein  estimator 


rje)  =  1  - 


f(p-2)  s 

f+2  zxfJ 


(6.12) 


James  and  Stein  have  shown  that  for  p  >_  3 

(el  i  2  ^ P  0  _i 

E[(T^e-0)  (T^e  -6)]  =  pa^ - Y+2 -  EC(2K1+p-2)  1  ]  (6.13) 

o 

where  is  a  Poisson  variable  with  parameter  v  /2.  The 

2 

expression  (6.13)  is  smaller  than  pa  for  all  v.  Ullah 
(1974)  computed  the  bias  and  MSE  of  individual  estimators: 


a-1  E(T^)-0.)  = 


v.  E[(2r1+P)  1i 


(6.14) 


2 

a-2  E(T^)-e.)2=  1 -|^||-  (l+g)E{(2K]+p-2)"1} 

-(l+££)E{(2K1+p)"1f] 


(6.15) 


where 


K  =  C(p+2)02  -  2  E  92  ]  [  T.  02. 


Similarly,  the  best  estimator  of  the  type  (6.5)  is 


=  e  +  fl - i_](X.-0),  i  =  l,...,p 

‘X  1  l+nZ  >  1 


(6.16) 


where  ¥=  (0^+...+0  )/p  and  =  ^(9^— 0)2/p  a2.  The  ECQL 


for  T2  as  in  (6.16)  is 


f  o  Co 

2  n  .  J2,  V 

P  o  - ^  1  p  o  - 

ll+T)  J  I  1+V 


2  1  P°o 


(6.17) 


_  2 

so  that  if  9  and  n  are  known  further  improvement  in  ECQL 
over  X  and  is  possible.  The  MSE  for  an  individual 
estimator  is 


E(T2i"8i)2  =  °2(n4+n^)/(i+n2)2  (6.18) 

2  _  2  2  2 

where  =  (8^-0)  /o  .  The  expression  (6.18)  exceeds  a  , 

2  2 

the  MSE  of  X^ ,  if  > 2 n  +1 ,  indicating  the  possibility  that 

in  joint  estimation  of  the  type  Tg ,  the  extreme  parameters 
are  less  efficiently  estimated  and  the  middle  parameters  more 


efficiently  than  the  corresponding  unbiased  estimators. 

_  2 

As  in  the  case  of  T^ ,  we  can  estimate  6  and  l/(l+n  )  by 
X  and  (p-3 )f s2/ ( f +2 ) T EfX^-X)2 ]  respectively  if  p  ■>  4  and  ob¬ 
tain  an  empirical  version  of  Tg 


T(e) 

2i 


i  1 


(6.19) 


i  =  1  ,  .  .  .  ,p(>4). 


It  can  be  shown  (Efron  and  Morris,  19 71,  1972a,  Eao ,  1976a) 


that 

E{(T^e)-e)'(T^e)-9)}  =  pa2 


f (p-3)2a2 

— ^ -  E{( 2Kg+p-3)  1 } ( 6 . 20) 


2 

where  Kg  is  a  Poisson  variable  with  parameter  n  /2.  The 

2  (el 

expression  (6.20)  is  less  than  po  so  that  T,1  is  uniformly 
better  than  X  with  respect  to  ECQL.  Rao  and  Shinozaki  (1978) 
have  shown  that  for  individual  estimators 
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E(T2i"6i)  =  "ff+2~)(6i~6)E{(2  K2  +  P-1)-1*  (6.21) 

2  2  f(P-3  )V 

E(T2i-e)  =  o  -  2( f+2)  (6.22) 

x  (c+d)E{(2  K2+ p-3)-1}-  g^|jffi^--)E{(2  K2  -tp-lT1  } 
where 

c=  (p-d/p,  d  =  {(p+i)(0.-e)2-^£^i(e.-e)2}/E(e.-0)2. 

Finally,  the  best  estimator  of  the  type  (6.6)  is 
(6.-0)2 

T„  .  =  0  +  -  (X  -9),  i=l . p  (6.23) 

(9.-9)^+cT  1 

1  o 

and  its  empirical  version  is 
(e)  -  <V^2 

T^;  =  X  +  — 1 - — ^  (X.-X),  i  =  l,...,p.  (6.24) 

31  (X.-X)  +s3  1 

l 

It  is  difficult  to  compute  the  ECQL  of  t!6^  or  the  MSE  of 

T(e) 

*3i  • 

(  0  ) 

The  relative  performance  of  the  estimators  and 

Tg  7  and  the  ranges  of  parametric  values  for  which  the 

(  0  )  (o') 

individual  and  Tg  -  estimators  are  better  than  the 

X-estimators  are  examined  in  Rao  and  Shinozaki  (1978). 

Table  8  contains  the  results  of  simulation  studies  based  on 
thousand  samples  for  the  estimation  of  four  parameters 
9j  =  a  a,  9g  -  (a+d)a,  9g  =  (a+2d)a  and  8^  =  (a+3d)a  for  various 
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combinations  of  a  and  d,  assuming  o  to  be  known,  i.e.,  by 

replacing  f£2/(f+2)  by  a2  in  the  formulae  (6.12)  and  (6.19)  for 
( e)  ( e )  1 

Tg  and  .  The  broad  conclusions  remain  the  same  if  the 

2 

random  variable  s  is  used  provided  f,  the  degrees  of  freedom, 
is  not  small. 

For  each  combination  of  "0/a  and  d,  Table  8  gives  the 

(  0  ) 

values  of  E,  PN  and  B  in  the  first  row  for  T^  ,  in  the  sec- 
ond  row  for  T^  '  and  in  the  third  row  for  T^  ' ,  where  for  any 
statistic 

e  =  [E(t.-e.)2  T  E(xi-ei)2i% 

PN  =  Pr.  ( It.-eJ  <  |x.-e.|), 

B  =  E(t.-e.). 

On  the  basis  of  previous  investigations  by  Efron  and 

Morris  (1971,  1972a, b,  1973a, b),  Rao  (1975a,  1975b,  1975c,  1977) 

and  Rao  and  Shinozaki  (1978)  and  the  present  simulation 

studies,  the  following  broad  conclusions  emerge. 

(  0  (  0  } 

(a)  There  is  some  advantage  in  using  T^  and  T^  when 

(  0  ) 

the  range  of  parameter  values  is  small,  and  T^  when  both 
the  range  and  values  of  the  parameters  are  small  compared 
to  the  standard  error  of  the  unbiased  estimators. 

(b)  When  the  range  of  parameters  is  large  both  Tj  and 

f  0  } 

Tg  tend  to  have  the  same  properties  as  X.  But  the  perfor- 
(  0  ) 

mance  of  T^  tends  to  be  erratic. 
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(c)  When  the  range  of  parameters  is  moderate, 
gives  higher  precision  for  the  parameters  with  smaller 
values  at  the  expense  of  lower  precision  for  higher  values. 
According  to  the  PN  criterion,  only  the  smallest  of  the  four 
parameters  is  better  estimated  than  the  corresponding  un- 

(  Q  ^  (0^ 

biased  estimator.  In  the  case  of  T),  and  ,  extreme 

values  of  the  parameters  suffer  at  the  expense  of  increased 
precision  for  the  middle  values. 

(  0  ^  (  0  } 

(d)  One  drawback  in  using  estimators  ,  Tg  and 

(  0  ) 

in  preference  to  X,  is  the  bias  in  these  estimators. 

The  bias  is  of  a  substantial  magnitude  for  the  higher  values 
(  0  \ 

in  the  case  of  and  for  the  extreme  values  of  the  para- 

meters  in  the  case  of  and  ' .  There  are  situations 

where  bias  in  the  estimators  may  have  serious  consequences 
such  as  the  following. 

Suppose  there  are  four  regions  and  periodical  estimates 
of  a  certain  characteristic  are  needed  for  sharing  some  re¬ 
sources  in  proportion  to  the  values  of  the  characteristic  of  the 

(  £  \  f  0  \ 

four  regions.  If  each  time,  estimates  of  the  type  T^  ,  Tg 
(  0  \ 

and  T^  are  used,  some  regions  stand  to  gain  at  the  expense 
of  the  others  in  the  long  run  (Rao,  1977,  1979). 

In  some  situations,  the  individual  parameters  may  not  be 
of  direct  interest  but  certain  linear  combinations  may  be 
important.  If  c'0  =  c^ 0^+ . . . +c^ 6^  is  a  linear  combination  to 
be  estimated,  should  one  estimate  it  by  c'X  or  c'T^e^  or 
c  Tg  or  c  T^  '?  Naturally,  the  answer  depends  on  the  vector 


c  (Rao,  1975b,  1975c),  and  the  optimal  properties  of  T^e\ 

(  0  )  (  0  ) 

Tg  and  with  respect  to  a  single  criterion  like  the 

CQL  do  not  insure  their  efficiency  in  different  ways  they 

may  be  used  for  practical  purposes.  If  multiple  uses  are 

intended,  the  best  plan  is  to  place  on  record  X  as  the  esti- 

2 

mator  of  0  (together  with  an  estimate  of  o  )  leaving  it  to 
the  user  to  make  any  optimal  adjustments  in  X  depending  on 
particular  problems  under  study. 

Note  1 .  It  may  be  of  historical  interest  to  note  that  esti¬ 
mators  of  the  type  have  been  constructed  under  more  gen¬ 
eral  conditions,  in  multivariate  analysis,  for  purposes  of 
genetic  selection  by  Fairfield  Smith  (1936),  Hazel  (1943) 
and  Rao  (1953)  based  on  an  idea  suggested  by  Fisher.  The 
problem  was  as  follows.  Let  (_0,  y^,...,^)  be  (m+1)  vector 
variables  representing  the  unobservable  genetic  values  _9 
and  repeated  independent  phenotypic  vector  measurements 
Xl’.’.'Xm  on  an  individual.  The  variables  are  related  by 
the  model 


Xf  =  £+i-i  »  1  “  1  •  •  *  •  > m •  (6.24) 

The  genetic  worth  of  an  individual  is  measured  by  a  linear 
function  g'0.  Suppose  that  we  have  observed  p  individuals 
from  a  population,  with  phenotypic  measurements 


<X,J 


■w-  J  ■ 1 


>P' 


(6.25) 


What  is  the  best  way  estimating  the  genetic  worths  g' .  ,g'£ 
of  these  individuals  for  purposes  of  ranking  and  selecting  a 
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given  proportion  of  the  individuals  with  the  largest  genetic 


values?  If  represents  the  mean  of  the  measurements  (6.25) 

for  individual  j,  then  g’£.  is  an  unbiased  estimator  of  g'£, . 

1  J 

However,  Fisher  suggested  the  regression  of  g’i9.  on  y  .  as  the 

0  J 

appropriate  selection  index,  which  involves  the  knowledge  of 

jj  =  E(t)),  T  =  cov(P,9),  A  =  cov(7,7),  where  7  =  m  1  f,  e_.  .  The 

regression  estimator  of  g'^B.  is  g'j).  where 

J  .1 


0.  =  u+[I-A(F+A)  1  Ky.-p)  •  (6.26) 

J  *■ 

By  multivariate  analysis  of  variance  and  covariance  of  the 
data  (6.25),  we  obtain  dispersion  matrices  B  and  W  as  between 
and  within  individuals  with  degrees  of  freedom  (p-1)  and 
f  =  m(p-l)  respectively,  which  supply  estimates  R/(p-l)  of 
(T+A)  and  W/mf  of  A.  Then  an  empirical  version  of  (6.26)  is 

e.  =  y  +  B_1l  (v.-y)  (6.27) 

where  py  =2y  .  .  The  details  leading  to  the  formula  (6.27)  are 

given  in  Ran  (1953,  pp .  237-8).  When  all  the  variables  are 

(  G  ^ 

one  dimensional,  (6.27)  is  the  same  as  T9  .  of  (6.19)  except 
that  the  multiplying  factor  (p-l)/f  is  replaced  by  (p~3)(f+2). 
In  the  1953  paper,  Rao  also  considered  some  distributional 
problems  for  testing  hypotheses  concerning  the  rank  of  the  T 
matrix  and  the  efficiency  of  the  regression  estimator. 

It  should  be  noted  that  the  regression  estimators  (or 
empirical  Bayes  estimators)  g'_0,  are  appropriate  in  the  problem 
of  selection  where  the  total  genetic  worth  of  the  selected 


30 


subset  of  individuals  has  to  be  maximized.  In  such  a  case, 
it  is  well  known  that  the  best  ordering  of  the  observed 
individuals  is  achieved  by  using,  as  the  selection  index, 
the  regression  of  genetic  worth  on  phenotypic  measurements 
(see  Cochran,  1951  and  Henderson,  1963).  But  the  regression 
estimator  may  not  be  appropriate  if  the  genetic  worth  of  each 
individual  has  to  be  assessed  for  other  purposes  which  may 
demand  equal  precision  for  the  individual  estimators. 

Note  2.  In  his  presidential  address  delivered  to  the  Royal 
Statistical  Society,  Finney  (1974)  suggested  that  the  problem 
of  simultaneous  estimation  may  be  approached  through  the  prin¬ 
ciple  of  maximum  likelihood,  thus  avoiding  the  use  of  the 
arbitrary  compound  quadratic  loss  function.  Let  ~  N(0^,4>), 

i  =  1 . .  be  p  independent  observations.  If  0^  arise  as  a 

random  sample  from  N(y,x),  then  the  log  likelihood  is,  apart 
from  a  constant, 


L 


E(X.-6.)2  Z(6.-y)2 

__ l  l  _  i 

2  4>  2  1 


(6.28) 


Finney  maximizes  L  with  respect  to  y  and  0^  and  obtains  the 
est imates 

-  X  +  (l  -  (X.-X).  (6.29) 


It  is  not  known  whether  the  maximum  likelihood  principle 
applies  in  situations  such  as  (6.28)  where  the  likelihood  is 
a  function  of  both  the  unknown  parameters  and  unobserved  random 
variables.  Finney  says  that  an  unbiased  estimate  of  l/(x+4>) 


_  2 

is  (p-3)/E(X^-X)  ,  so  that  when  t  is  not  known,  the  estimator 

of  0 .  is 
1 


0.  =  X  + 

l 


1  - 


]  (X._x) 

vx  1 


(6.30) 


£(X.-X) 


which  is  the  same  as  the  expression  given  by  Lindley  (1962) 
using  Bayes  theorem  and  quadratic  loss  function. 

However,  if  t  is  unknown,  the  appropriate  log  likelihood 
is  proportional  to 


UX.-Q.Y 

~  f  loS  T - 2T~ 


X(0i-y)‘ 
— - 


(6.31) 


The  expression  (6.31)  can  be  made  arbitrarily  large  by  choos¬ 
ing  0^  =  jj  =  X  for  all  i  and  letting  t  -*■  0 .  Thus,  the  m.l. 
estimator  is  G.  =  X  for  all  il  Such  anamolies  do  or  .ur  when 
unobserved  random  variables  are  included  as  unknowns  and  a 
"full  likelihood"  function  such  as  (6.31)  is  considered  for 
drawing  inference. 


I  wish  to  thank  Robert  Boudreau  for  the  simulation  studies 


reported  in  the  various  tables. 
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